Goto

Collaborating Authors

 communication step




FTTE: Federated Learning on Resource-Constrained Devices

arXiv.org Artificial Intelligence

Federated learning (FL) enables collaborative model training across distributed devices while preserving data privacy, but deployment on resource-constrained edge nodes remains challenging due to limited memory, energy, and communication bandwidth. Traditional synchronous and asynchronous FL approaches further suffer from straggler induced delays and slow convergence in heterogeneous, large scale networks. We present FTTE (Federated Tiny Training Engine),a novel semi-asynchronous FL framework that uniquely employs sparse parameter updates and a staleness-weighted aggregation based on both age and variance of client updates. Extensive experiments across diverse models and data distributions - including up to 500 clients and 90% stragglers - demonstrate that FTTE not only achieves 81% faster convergence, 80% lower on-device memory usage, and 69% communication payload reduction than synchronous FL (eg.FedAVG), but also consistently reaches comparable or higher target accuracy than semi-asynchronous (eg.FedBuff) in challenging regimes. These results establish FTTE as the first practical and scalable solution for real-world FL deployments on heterogeneous and predominantly resource-constrained edge devices.


Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

arXiv.org Artificial Intelligence

We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.


Reviews: Robust and Communication-Efficient Collaborative Learning

Neural Information Processing Systems

This paper propose a new algorithm to solve a stochastic optimization problem distributively over the nodes of a graph of computing units. As usual, the algorithm involves a communication step and a computing step at each iteration. The algorithms only involves local computations. To be robust to the effect of stragglers among the computing units, the proposed algorithm imposes a deadline in computation time at each iteration. To avoid the communication cost, only quantized version of the local iterates are exchanged during communication steps.


Learning Multiagent Communication with Backpropagation

Neural Information Processing Systems

Many tasks in AI require the collaboration of multiple agents. Typically, the communication protocol between agents is manually specified and not altered during training. In this paper we explore a simple neural model, called CommNet, that uses continuous communication for fully cooperative tasks. The model consists of multiple agents and the communication between them is learned alongside their policy. We apply this model to a diverse set of tasks, demonstrating the ability of the agents to learn to communicate amongst themselves, yielding improved performance over non-communicative agents and baselines. In some cases, it is possible to interpret the language devised by the agents, revealing simple but effective strategies for solving the task at hand.


Heterogeneous Federated Learning via Personalized Generative Networks

arXiv.org Artificial Intelligence

Federated Learning (FL) allows several clients to construct a common global machine-learning model without having to share their data. FL, however, faces the challenge of statistical heterogeneity between the client's data, which degrades performance and slows down the convergence toward the global model. In this paper, we provide theoretical proof that minimizing heterogeneity between clients facilitates the convergence of a global model for every single client. This becomes particularly important under empirical concept shifts among clients, rather than merely considering imbalanced classes, which have been studied until now. Therefore, we propose a method for knowledge transfer between clients where the server trains client-specific generators. Each generator generates samples for the corresponding client to remove the conflict with other clients' models. Experiments conducted on synthetic and real data, along with a theoretical study, support the effectiveness of our method in constructing a well-generalizable global model by reducing the conflict between local models.


Systems for Parallel and Distributed Large-Model Deep Learning Training

arXiv.org Artificial Intelligence

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore increasingly large neural architectures, with some recent Transformer models spanning hundreds of billions of learnable parameters. These designs have introduced new scale-driven systems challenges for the DL space, such as memory bottlenecks, poor runtime efficiency, and high costs of model development. Efforts to address these issues have explored techniques such as parallelization of neural architectures, spilling data across the memory hierarchy, and memory-efficient data representations. This survey will explore the large-model training systems landscape, highlighting key challenges and the various techniques that have been used to address them.


Finite Sample Guarantees for Distributed Online Parameter Estimation with Communication Costs

arXiv.org Artificial Intelligence

We study the problem of estimating an unknown parameter in a distributed and online manner. Existing work on distributed online learning typically either focuses on asymptotic analysis, or provides bounds on regret. However, these results may not directly translate into bounds on the error of the learned model after a finite number of time-steps. In this paper, we propose a distributed online estimation algorithm which enables each agent in a network to improve its estimation accuracy by communicating with neighbors. We provide non-asymptotic bounds on the estimation error, leveraging the statistical properties of the underlying model. Our analysis demonstrates a trade-off between estimation error and communication costs. Further, our analysis allows us to determine a time at which the communication can be stopped (due to the costs associated with communications), while meeting a desired estimation accuracy. We also provide a numerical example to validate our results.


Hierarchical Federated Learning Across Heterogeneous Cellular Networks

#artificialintelligence

Our proposed HFL framework consists of 4 communication steps: uplink from MU to SBS, downlink from SBS to MU, uplink from SBS to MBS and downlink from MBS to SBS. For each communication step, we employ different sparsification parameters, ϕulMU, ϕdlSBS, ϕulSBS and ϕdlMBS respectively, to speed up the communication. We introduce the function Ω(V,ϕ):Rd Rd, which maps a d dimensional vector to its sparse form where only 1 ϕ portion of the indicies have non-zero values. The sparsification procedure in each step leads to an error in the parameter model and thus slows down the convergence. To overcome this issue we employ the discounted error accumulation technique, similar to [SGD_q4, fedlearn7], which uses the discounted version of the error for the next model update.